{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas批量拆分Excel与合并Excel\n",
    "\n",
    "实例演示：\n",
    "1. 将一个大Excel等份拆成多个Excel\n",
    "2. 将多个小Excel合并成一个大Excel并标记来源"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "work_dir=\"./course_datas/c15_excel_split_merge\"\n",
    "splits_dir=f\"{work_dir}/splits\"\n",
    "\n",
    "import os\n",
    "if not os.path.exists(splits_dir):\n",
    "    os.mkdir(splits_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 0、读取源Excel到Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_source = pd.read_excel(f\"{work_dir}/crazyant_blog_articles_source.xlsx\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2585</td>\n",
       "      <td>Tensorflow怎样接收变长列表特征</td>\n",
       "      <td>python,tensorflow,特征工程</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2583</td>\n",
       "      <td>Pandas实现数据的合并concat</td>\n",
       "      <td>pandas,python,数据分析</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2574</td>\n",
       "      <td>Pandas的Index索引有什么用途？</td>\n",
       "      <td>pandas,python,数据分析</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2564</td>\n",
       "      <td>机器学习常用数据集大全</td>\n",
       "      <td>python,机器学习</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2561</td>\n",
       "      <td>一个数据科学家的修炼路径</td>\n",
       "      <td>数据分析</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     id                 title                    tags\n",
       "0  2585  Tensorflow怎样接收变长列表特征  python,tensorflow,特征工程\n",
       "1  2583   Pandas实现数据的合并concat      pandas,python,数据分析\n",
       "2  2574  Pandas的Index索引有什么用途？      pandas,python,数据分析\n",
       "3  2564           机器学习常用数据集大全             python,机器学习\n",
       "4  2561          一个数据科学家的修炼路径                    数据分析"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_source.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=258, step=1)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_source.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(258, 3)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_source.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "258"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "total_row_count = df_source.shape[0]\n",
    "total_row_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 一、将一个大Excel等份拆成多个Excel\n",
    "\n",
    "1. 使用df.iloc方法，将一个大的dataframe，拆分成多个小dataframe\n",
    "2. 将使用dataframe.to_excel保存每个小Excel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1、计算拆分后的每个excel的行数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这个大excel，会拆分给这几个人\n",
    "user_names = [\"xiao_shuai\", \"xiao_wang\", \"xiao_ming\", \"xiao_lei\", \"xiao_bo\", \"xiao_hong\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "43"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 每个人的任务数目\n",
    "split_size = total_row_count // len(user_names)\n",
    "if total_row_count % len(user_names) != 0:\n",
    "    split_size += 1\n",
    "\n",
    "split_size"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2、拆分成多个dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_subs = []\n",
    "for idx, user_name in enumerate(user_names):\n",
    "    # iloc的开始索引\n",
    "    begin = idx*split_size\n",
    "    # iloc的结束索引\n",
    "    end = begin+split_size\n",
    "    # 实现df按照iloc拆分\n",
    "    df_sub = df_source.iloc[begin:end]\n",
    "    # 将每个子df存入列表\n",
    "    df_subs.append((idx, user_name, df_sub))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3、将每个datafame存入excel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "for idx, user_name, df_sub in df_subs:\n",
    "    file_name = f\"{splits_dir}/crazyant_blog_articles_{idx}_{user_name}.xlsx\"\n",
    "    df_sub.to_excel(file_name, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 二、合并多个小Excel到一个大Excel\n",
    "\n",
    "1. 遍历文件夹，得到要合并的Excel文件列表\n",
    "2. 分别读取到dataframe，给每个df添加一列用于标记来源\n",
    "3. 使用pd.concat进行df批量合并\n",
    "4. 将合并后的dataframe输出到excel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. 遍历文件夹，得到要合并的Excel名称列表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['crazyant_blog_articles_1_xiao_wang.xlsx',\n",
       " 'crazyant_blog_articles_0_xiao_shuai.xlsx',\n",
       " 'crazyant_blog_articles_5_xiao_hong.xlsx',\n",
       " 'crazyant_blog_articles_4_xiao_bo.xlsx',\n",
       " 'crazyant_blog_articles_3_xiao_lei.xlsx',\n",
       " 'crazyant_blog_articles_2_xiao_ming.xlsx']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "excel_names = []\n",
    "for excel_name in os.listdir(splits_dir):\n",
    "    excel_names.append(excel_name)\n",
    "excel_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. 分别读取到dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "crazyant_blog_articles_1_xiao_wang.xlsx xiao_wang\n",
      "crazyant_blog_articles_0_xiao_shuai.xlsx xiao_shuai\n",
      "crazyant_blog_articles_5_xiao_hong.xlsx xiao_hong\n",
      "crazyant_blog_articles_4_xiao_bo.xlsx xiao_bo\n",
      "crazyant_blog_articles_3_xiao_lei.xlsx xiao_lei\n",
      "crazyant_blog_articles_2_xiao_ming.xlsx xiao_ming\n"
     ]
    }
   ],
   "source": [
    "df_list = []\n",
    "\n",
    "for excel_name in excel_names:\n",
    "    # 读取每个excel到df\n",
    "    excel_path = f\"{splits_dir}/{excel_name}\"\n",
    "    df_split = pd.read_excel(excel_path)\n",
    "    # 得到username\n",
    "    username = excel_name.replace(\"crazyant_blog_articles_\", \"\").replace(\".xlsx\", \"\")[2:]\n",
    "    print(excel_name, username)\n",
    "    # 给每个df添加1列，即用户名字\n",
    "    df_split[\"username\"] = username\n",
    "    \n",
    "    df_list.append(df_split)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. 使用pd.concat进行合并"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merged = pd.concat(df_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(258, 4)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>title</th>\n",
       "      <th>tags</th>\n",
       "      <th>username</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2120</td>\n",
       "      <td>Zookeeper并不保证读取的是最新数据</td>\n",
       "      <td>zookeeper</td>\n",
       "      <td>xiao_wang</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2089</td>\n",
       "      <td>Mybatis源码解读-初始化过程详解</td>\n",
       "      <td>mybatis</td>\n",
       "      <td>xiao_wang</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2076</td>\n",
       "      <td>怎样借助Python爬虫给宝宝起个好名字</td>\n",
       "      <td>python,爬虫</td>\n",
       "      <td>xiao_wang</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2022</td>\n",
       "      <td>Mybatis源码解读-设计模式总结</td>\n",
       "      <td>mybatis,设计模式</td>\n",
       "      <td>xiao_wang</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2012</td>\n",
       "      <td>打工者心态、主人公意识、个人公司品牌</td>\n",
       "      <td>程序人生</td>\n",
       "      <td>xiao_wang</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     id                  title          tags   username\n",
       "0  2120  Zookeeper并不保证读取的是最新数据     zookeeper  xiao_wang\n",
       "1  2089    Mybatis源码解读-初始化过程详解       mybatis  xiao_wang\n",
       "2  2076   怎样借助Python爬虫给宝宝起个好名字     python,爬虫  xiao_wang\n",
       "3  2022     Mybatis源码解读-设计模式总结  mybatis,设计模式  xiao_wang\n",
       "4  2012     打工者心态、主人公意识、个人公司品牌          程序人生  xiao_wang"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "xiao_shuai    43\n",
       "xiao_ming     43\n",
       "xiao_wang     43\n",
       "xiao_bo       43\n",
       "xiao_lei      43\n",
       "xiao_hong     43\n",
       "Name: username, dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_merged[\"username\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. 将合并后的dataframe输出到excel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_merged.to_excel(f\"{work_dir}/crazyant_blog_articles_merged.xlsx\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
